---
displayed_sidebar: "English"
---

# Generated columns

Since v3.1, StarRocks supports generated columns. Generated columns can be used to accelerate queries with complex expressions. This feature supports precomputing and storing the results of expressions and [query rewrites](#query-rewrites), which significantly accelerates queries with the same complex expressions.

You can define one or more generated columns to store the results of expressions at table creation. As such, when executing queries that contain the expression whose results are stored in the generated column you have defined, the CBO rewrites the query to read data directly from the generated column. Alternatively, you can directly query the data in the generated column.

It is also recommended to **evaluate the impact of generated columns on loading performance because computing expressions takes some time**. Additionally, it is advised to **[create generated columns at table creation](#create-generated-columns-at-table-creation-recommended) rather than adding or modifying them after table creation**. Because it is time-consuming and costly to add or modify generated columns after table creation.

## Basic operations

### Create generated columns

#### Syntax

```SQL
<col_name> <data_type> [NULL] AS <expr> [COMMENT 'string']
```

#### Create generated columns at table creation (Recommended)

Create a table named `test_tbl1` with five columns of which columns `newcol1` and `newcol2` are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns `data_array` and `data_json` respectively.

```SQL
CREATE TABLE test_tbl1
(
    id INT NOT NULL,
    data_array ARRAY<int> NOT NULL,
    data_json JSON NOT NULL,
    newcol1 DOUBLE AS array_avg(data_array),
    newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);
```

**NOTICE**:

- Generated columns must be defined after regular columns.
- Aggregate functions cannot be used in the expressions for generated columns.
- The expressions for generated column cannot reference other generated columns or [auto-increment columns](./auto_increment.md), but the expressions can reference multiple regular columns.
- The data type of a generated column must match the data type of the result generated by the expression for the generated column.
- Generated columns cannot be created on Aggregate tables.
- Currently, StarRocks's shared-data mode does not support generated columns.

#### Add generated columns after table creation

> **NOTICE**
>
> This operation is time-consuming and resource-intensive. Therefore, it is recommended to add generated columns at table creation. If it is unavoidable to use ALTER TABLE to add generated columns, it is recommended to evaluate the cost and time involved in advance.

1. Create a table named `test_tbl2` with three regular columns `id`, `data_array`, and `data_json`. Insert a data row into the table.

    ```SQL
    -- Create a table.
    CREATE TABLE test_tbl2
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NOT NULL
    )
    PRIMARY KEY (id)
    DISTRIBUTED BY HASH(id);

    -- Insert a data row.
    INSERT INTO test_tbl2 VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));

    -- Query the table.
    MySQL [example_db]> select * from test_tbl2;
    +------+------------+------------------+
    | id   | data_array | data_json        |
    +------+------------+------------------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |
    +------+------------+------------------+
    1 row in set (0.04 sec)
    ```

2. Execute ALTER TABLE ... ADD COLUMN ... to add generated columns `newcol1` and `newcol2`, which are created by evaluating the expressions based on the values of regular columns `data_array` and `data_json`.

    ```SQL
    ALTER TABLE test_tbl2
    ADD COLUMN newcol1 DOUBLE AS array_avg(data_array);

    ALTER TABLE test_tbl2
    ADD COLUMN newcol2 String AS json_string(json_query(data_json, "$.a"));
    ```

    **NOTICE**:

    - Adding generated columns to Aggregate tables is not supported.
    - Regular columns need to be defined before generated columns. When you use the ALTER TABLE ... ADD COLUMN ... statement to add a regular column without specifying the position of the new regular column, the system automatically places it before the generated columns. Moreover, you cannot use AFTER to explicitly place the regular column after a generated column.

3. Query the table data.

    ```SQL
    MySQL [example_db]> SELECT * FROM test_tbl2;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.04 sec)
    ```

    The result shows that the generated columns `newcol1` and `newcol2` are added to the table, and StarRocks automatically computes their values based on the expression.

### Load data into generated columns

During data loading, StarRocks automatically calculates the values for generated columns based on the expressions. You cannot specify the values of generated columns. The following example uses the [INSERT INTO](../../loading/InsertInto.md) statement to load data:

1. Use INSERT INTO to insert a record into the `test_tbl1` table. Note that you cannot specify the values for the generated columns within the `VALUES ()` clause.

    ```SQL
    INSERT INTO test_tbl1 (id, data_array, data_json)
    VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
    ```

2. Query the table data.

    ```SQL
    MySQL [example_db]> SELECT * FROM test_tbl1;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.01 sec)
    ```

    The results show that StarRocks automatically computes the values for the generated columns `newcol1` and `newcol2` based on the expressions.

    **NOTICE**:

    The following error is returned if you specify values for the generated columns during data loading:

    ```SQL
    MySQL [example_db]> INSERT INTO test_tbl1 (id, data_array, data_json, newcol1, newcol2) 
    VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3");
    ERROR 1064 (HY000): Getting analyzing error. Detail message: materialized column 'newcol1' can not be specified.

    MySQL [example_db]> INSERT INTO test_tbl1 VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3");
    ERROR 1064 (HY000): Getting analyzing error. Detail message: Column count doesn't match value count.
    ```

### Modify generated columns

> **NOTICE**
>
> This operation is time-consuming and resource-intensive. If it is unavoidable to use ALTER TABLE to modify generated columns, it is recommended to evaluate the cost and time involved in advance.

You can modify the data type and expression of a generated column.

1. Create a table `test_tbl3` with five columns of which columns `newcol1` and `newcol2` are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns `data_array` and `data_json` respectively. Insert a data row into the table.

    ```SQL
    -- Create a table.
    MySQL [example_db]> CREATE TABLE test_tbl3
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NOT NULL,
        -- The data types and expressions of generated columns are specified as follows:
        newcol1 DOUBLE AS array_avg(data_array),
        newcol2 String AS json_string(json_query(data_json, "$.a"))
    )
    PRIMARY KEY (id)
    DISTRIBUTED BY HASH(id);

    -- Insert a data row.
    INSERT INTO test_tbl3 (id, data_array, data_json)
    VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));

    -- Query the table.
    MySQL [example_db]> select * from test_tbl3;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.01 sec)
    ```

2. Modified the generated columns `newcol1` and `newcol2`:

    - Change the data type of the generated column `newcol1` to `ARRAY<INT>` and change its expression to `data_array`.

        ```SQL
        ALTER TABLE test_tbl3 
        MODIFY COLUMN newcol1 ARRAY<INT> AS data_array;
        ```

    - Modify the expression of the generated column `newcol2` to extract the values of field `b` from the regular column `data_json`.

        ```SQL
        ALTER TABLE test_tbl3
        MODIFY COLUMN newcol2 String AS json_string(json_query(data_json, "$.b"));
        ```

3. View the modified schema and the data in the table.

    - View the modified schema.

        ```SQL
        MySQL [example_db]> show create table test_tbl3\G
        **** 1. row ****
            Table: test_tbl3
        Create Table: CREATE TABLE test_tbl3 (
        id int(11) NOT NULL COMMENT "",
        data_array array<int(11)> NOT NULL COMMENT "",
        data_json json NOT NULL COMMENT "",
        -- After modification, the data types and expressions of generated columns are as follows:
        newcol1 array<int(11)> NULL AS example_db.test_tbl3.data_array COMMENT "",
        newcol2 varchar(65533) NULL AS json_string(json_query(example_db.test_tbl3.data_json, '$.b')) COMMENT ""
        ) ENGINE=OLAP 
        PRIMARY KEY(id)
        DISTRIBUTED BY HASH(id)
        PROPERTIES (...);
        1 row in set (0.00 sec)
        ```

    - Query the table data after modification. The result shows that StarRocks recalculates the values of the generated columns `newcol1` and `newcol2` based on the modified expressions.

        ```SQL
        MySQL [example_db]> select * from test_tbl3;
        +------+------------+------------------+---------+---------+
        | id   | data_array | data_json        | newcol1 | newcol2 |
        +------+------------+------------------+---------+---------+
        |    1 | [1,2]      | {"a": 1, "b": 2} | [1,2]   | 2       |
        +------+------------+------------------+---------+---------+
        1 row in set (0.01 sec)
        ```

### Drop a generated column

Drop column `newcol1` from the table `test_tbl3`

```SQL
ALTER TABLE test_tbl3 DROP COLUMN newcol1;
```

> **NOTICE**:
>
> If a generated colum references a regular column in the expression, you cannot directly drop or modify that regular column. Instead, you need to first drop the generated column and then drop or modify the regular column.

### Query rewrites

If the expression in a query matches the expression of a generated column, the optimizer automatically rewrites the query to directly read the values of the generated column.

1. Suppose that you create a table `test_tbl4` with the following schema:

    ```SQL
    CREATE TABLE test_tbl4
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NOT NULL,
        newcol1 DOUBLE AS array_avg(data_array),
        newcol2 String AS json_string(json_query(data_json, "$.a"))
    )
    PRIMARY KEY (id) DISTRIBUTED BY HASH(id);
    ```

2. If you query the data in the table `test_tbl4` by using the `SELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4;` statement, the query involves only the regular columns `data_array` and `data_json`. However, the expressions in the query match the expressions of the generated columns `newcol1` and `newcol2`. In this case, the execution plan shows that the CBO automatically rewrites the query to read the values of the generated columns `newcol1` and `newcol2`.

    ```SQL
    MySQL [example_db]> EXPLAIN SELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4;
    +---------------------------------------+
    | Explain String                        |
    +---------------------------------------+
    | PLAN FRAGMENT 0                       |
    |  OUTPUT EXPRS:4: newcol1 | 5: newcol2 | -- The query is rewritten to read data from the generated columns newcol1 and newcol2 are accessed.
    |   PARTITION: RANDOM                   |
    |                                       |
    |   RESULT SINK                         |
    |                                       |
    |   0:OlapScanNode                      |
    |      TABLE: test_tbl4                 |
    |      PREAGGREGATION: ON               |
    |      partitions=0/1                   |
    |      rollup: test_tbl4                |
    |      tabletRatio=0/0                  |
    |      tabletList=                      |
    |      cardinality=1                    |
    |      avgRowSize=2.0                   |
    +---------------------------------------+
    15 rows in set (0.00 sec)
    ```

### Partial updates and generated columns

To perform partial updates on a Primary Key table, you must specify all the regular columns referenced by the generated columns in the `columns` parameter. The following example uses Stream Load to perform partial updates.

1. Create a table `test_tbl5` with five columns of which columns `newcol1` and `newcol2` are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns `data_array` and `data_json` respectively. Insert a data row into the table.

    ```SQL
    -- Create a table.
    CREATE TABLE test_tbl5
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NULL,
        newcol1 DOUBLE AS array_avg(data_array),
        newcol2 String AS json_string(json_query(data_json, "$.a"))
    )
    PRIMARY KEY (id)
    DISTRIBUTED BY HASH(id);

    -- Insert into a data row.
    INSERT INTO test_tbl5 (id, data_array, data_json)
    VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));

    -- Query the table.
    MySQL [example_db]> select * from test_tbl5;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.01 sec)
    ```

2. Prepare a CSV file `my_data1.csv` to update some columns in the `test_tbl5` table.

    ```SQL
    1|[3,4]|{"a": 3, "b": 4}
    2|[3,4]|{"a": 3, "b": 4} 
    ```

3. Use [Stream Load](../../loading/StreamLoad.md) with the `my_data1.csv` file to update some columns of the `test_tbl5` table. You need to set `partial_update:true` and specify all the regular columns referenced by the generated columns in the `columns` parameter.

    ```Bash
    curl --location-trusted -u <username>:<password> -H "label:1" \
        -H "column_separator:|" \
        -H "partial_update:true" \
        -H "columns:id,data_array,data_json" \ 
        -T my_data1.csv -XPUT \
        http://<fe_host>:<fe_http_port>/api/example_db/test_tbl5/_stream_load
    ```

4. Query the table data.

    ```SQL
    [example_db]> select * from test_tbl5;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [3,4]      | {"a": 3, "b": 4} |     3.5 | 3       |
    |    2 | [3,4]      | {"a": 3, "b": 4} |     3.5 | 3       |
    +------+------------+------------------+---------+---------+
    2 rows in set (0.01 sec)
    ```

An error is returned by Stream Load if you perform partial updates without specifying all the regular columns referenced by the generated columns.

1. Prepare a CSV file `my_data2.csv`.

      ```csv
      1|[3,4]
      2|[3,4]
      ```

2. When partial column updates are performed by using [Stream Load](../../loading/StreamLoad.md) with the `my_data2.csv` file, if the values for the `data_json` column are not provided in `my_data2.csv` and the `columns` parameter in the Stream Load job does not include the `data_json` column, even if the `data_json` column allows null values, an error is returned by Stream Load because the column `data_json` is referenced by the generated column `newcol2`.
